Imports Library
import sys
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.applications import DenseNet121
from tensorflow.keras.preprocessing.image import ImageDataGenerator
1. Exploratory Data Analysis (EDA)¶
📊 Weight Dataset Processing and Cleaning¶
Overview¶
This notebook loads and preprocesses weight dataset metadata from an Excel file.
It includes data cleaning, error correction, and
image path extraction.
Steps:¶
1️⃣ Load the dataset from an Excel file 📂
2️⃣ Convert weight values from string (with commas) to numeric format
🔢
3️⃣ Fix known errors (e.g., incorrect weight entries) 🛠️
4️⃣ Extract image file paths for further processing 🖼️
print(sys.executable)
def load_metadata(excel_path):
"""
Loads the metadata from the given Excel file
and returns a cleaned pandas DataFrame.
"""
df = pd.read_excel(excel_path)
# Fix error found from the steps below, and rerun -> 4,75
df["weight"] = df["weight"].astype(str).apply(lambda x: x.replace(",", "."))
# Convert 'weight' column to numeric (coerce errors to NaN)
df['weight'] = pd.to_numeric(df['weight'], errors='coerce')
# # Fix one entry from 815 lbs to 8.15 lbs
# df.loc[df["weight"] == 815, "weight"] = 8.15
# Extract local file path from 'Row Data'
def get_local_path(row_data):
file_name = row_data.split('/')[-1]
return os.path.join('/opt/weight_dataset_v1', file_name)
if "Row Data" in df.columns:
df['img_path'] = df['Row Data'].apply(get_local_path)
else:
print("Warning: 'Row Data' column not found in the dataset.")
return df
/bin/python3
📊 Weight Dataset: Data Exploration & Cleaning¶
📌 Overview¶
This notebook processes and explores a weight dataset from an Excel
file.
The goal is to clean, analyze, and visualize the data for further use in
machine learning models.
🔍 Steps¶
1️⃣ Load Data from an Excel file 📂
2️⃣ Check for Missing Values and visualize missing data 🔎
3️⃣ Identify & Handle Outliers in the weight column 🚨
4️⃣ Explore Data Distributions using plots 📊
5️⃣ Detect Duplicates and unique values 📝
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
!pip install missingno
import missingno as msno
# Load dataset
excel_path = "/opt/weight_dataset_v1/excel_dataset/data-batch-01.xlsx"
df = load_metadata(excel_path)
## 🔍 Basic Data Exploration
print("Dataset Overview:")
print(df.info())
print("\nFirst 5 Rows:")
print(df.head())
## 📊 Missing Values Analysis
print("\nMissing values per column:")
print(df.isnull().sum())
# Missing values visualization
plt.figure(figsize=(10, 5))
sns.heatmap(df.isnull(), cmap="viridis", cbar=False, yticklabels=False)
plt.title("Missing Values Heatmap")
plt.show()
## 🔢 Statistical Summary
print("\nStatistical Summary:")
print(df.describe())
# Unique values per categorical column
categorical_cols = df.select_dtypes(include=["object"]).columns
print("\nUnique values per categorical column:")
for col in categorical_cols:
print(f"{col}: {df[col].nunique()} unique values")
## 🚩 Outlier Detection (Weight)
plt.figure(figsize=(8, 5))
sns.boxplot(x=df["weight"])
plt.title("Outlier Detection - Weight Column")
plt.show()
# Count rows where weight is above the 99th percentile
upper_bound = df["weight"].quantile(0.99)
print(f"\nRows with extremely high weight (>99th percentile {upper_bound:.2f}):")
print(df[df["weight"] > upper_bound])
## 📝 Check for Duplicates
print("\nDuplicate Rows:", df.duplicated().sum())
## 🔎 Value Distribution
plt.figure(figsize=(10, 5))
sns.histplot(df["weight"], bins=30, kde=True)
plt.title("Weight Distribution")
plt.xlabel("Weight")
plt.ylabel("Count")
plt.show()
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: missingno in ./.local/lib/python3.10/site-packages (0.5.2)
Requirement already satisfied: numpy in ./.local/lib/python3.10/site-packages (from missingno) (2.0.2)
Requirement already satisfied: matplotlib in ./.local/lib/python3.10/site-packages (from missingno) (3.10.1)
Requirement already satisfied: scipy in ./.local/lib/python3.10/site-packages (from missingno) (1.15.2)
Requirement already satisfied: seaborn in ./.local/lib/python3.10/site-packages (from missingno) (0.13.2)
Requirement already satisfied: contourpy>=1.0.1 in ./.local/lib/python3.10/site-packages (from matplotlib->missingno) (1.3.1)
Requirement already satisfied: cycler>=0.10 in ./.local/lib/python3.10/site-packages (from matplotlib->missingno) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in ./.local/lib/python3.10/site-packages (from matplotlib->missingno) (4.56.0)
Requirement already satisfied: kiwisolver>=1.3.1 in ./.local/lib/python3.10/site-packages (from matplotlib->missingno) (1.4.8)
Requirement already satisfied: packaging>=20.0 in ./.local/lib/python3.10/site-packages (from matplotlib->missingno) (24.2)
Requirement already satisfied: pillow>=8 in ./.local/lib/python3.10/site-packages (from matplotlib->missingno) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/lib/python3/dist-packages (from matplotlib->missingno) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7 in ./.local/lib/python3.10/site-packages (from matplotlib->missingno) (2.9.0.post0)
Requirement already satisfied: pandas>=1.2 in ./.local/lib/python3.10/site-packages (from seaborn->missingno) (2.2.3)
Requirement already satisfied: pytz>=2020.1 in /usr/lib/python3/dist-packages (from pandas>=1.2->seaborn->missingno) (2022.1)
Requirement already satisfied: tzdata>=2022.7 in ./.local/lib/python3.10/site-packages (from pandas>=1.2->seaborn->missingno) (2025.1)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.7->matplotlib->missingno) (1.16.0)
Dataset Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 675 entries, 0 to 674
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 675 non-null object
1 Global Key 675 non-null object
2 Row Data 675 non-null object
3 Dataset ID 675 non-null object
4 Dataset Name 675 non-null object
5 Created At 675 non-null object
6 Updated At 675 non-null object
7 Created By 675 non-null object
8 Height 675 non-null int64
9 Width 675 non-null int64
10 Asset Type 675 non-null object
11 MIME Type 675 non-null object
12 EXIF Rotation 675 non-null int64
13 Experiment ID 675 non-null object
14 Experiment Name 675 non-null object
15 Run Name 675 non-null object
16 Run Data Row ID 675 non-null object
17 Split 675 non-null object
18 Label Kind 675 non-null object
19 Version 675 non-null object
20 Label ID 675 non-null object
21 Feature ID 675 non-null object
22 Feature Schema ID 675 non-null object
23 Name 675 non-null object
24 Value 675 non-null object
25 Annotation Kind 675 non-null object
26 Bounding Box Top 675 non-null int64
27 Bounding Box Left 675 non-null int64
28 Bounding Box Height 675 non-null int64
29 Bounding Box Width 675 non-null int64
30 species 675 non-null object
31 gender 123 non-null object
32 color 675 non-null object
33 weight 675 non-null float64
34 img_path 675 non-null object
dtypes: float64(1), int64(7), object(27)
memory usage: 184.7+ KB
None
First 5 Rows:
ID \
0 clyxetrm60mcb0796rsdu4ob9
1 clyxetrm60mcc0796uilaudlq
2 clyxetrm60mcd0796albl43as
3 clyxetrm60mce0796r9gf6geg
4 clyxetrm60mcf0796zbzj1nb6
Global Key \
0 upload-raw-images/circleseafoods-camera-03/202...
1 upload-raw-images/circleseafoods-camera-03/202...
2 upload-raw-images/circleseafoods-camera-03/202...
3 upload-raw-images/circleseafoods-camera-03/202...
4 upload-raw-images/circleseafoods-camera-03/202...
Row Data \
0 gs://upload-raw-images/circleseafoods-camera-0...
1 gs://upload-raw-images/circleseafoods-camera-0...
2 gs://upload-raw-images/circleseafoods-camera-0...
3 gs://upload-raw-images/circleseafoods-camera-0...
4 gs://upload-raw-images/circleseafoods-camera-0...
Dataset ID Dataset Name \
0 clyxesxqf00se0776dq71ylh0 Circleseafoods-18-July
1 clyxesxqf00se0776dq71ylh0 Circleseafoods-18-July
2 clyxesxqf00se0776dq71ylh0 Circleseafoods-18-July
3 clyxesxqf00se0776dq71ylh0 Circleseafoods-18-July
4 clyxesxqf00se0776dq71ylh0 Circleseafoods-18-July
Created At Updated At \
0 2024-07-22T19:58:52.688+00:00 2024-07-22T19:58:59.539+00:00
1 2024-07-22T19:58:52.688+00:00 2024-07-22T19:59:07.704+00:00
2 2024-07-22T19:58:52.688+00:00 2024-07-22T19:59:08.212+00:00
3 2024-07-22T19:58:52.688+00:00 2024-07-22T19:59:07.983+00:00
4 2024-07-22T19:58:52.688+00:00 2024-07-22T19:59:07.228+00:00
Created By Height Width ... Annotation Kind Bounding Box Top \
0 deepak@this.fish 720 1280 ... ImageBoundingBox 0
1 deepak@this.fish 720 1280 ... ImageBoundingBox 0
2 deepak@this.fish 720 1280 ... ImageBoundingBox 0
3 deepak@this.fish 720 1280 ... ImageBoundingBox 0
4 deepak@this.fish 720 1280 ... ImageBoundingBox 0
Bounding Box Left Bounding Box Height Bounding Box Width species gender \
0 492 699 193 chum female
1 449 720 235 chum male
2 448 720 235 chum male
3 449 720 237 chum male
4 449 720 238 chum male
color weight img_path
0 bright 4.75 /opt/weight_dataset_v1/2024_07_18_17_36_30_792...
1 dark 13.95 /opt/weight_dataset_v1/2024_07_18_17_37_23_113...
2 dark 13.95 /opt/weight_dataset_v1/2024_07_18_17_37_23_113...
3 dark 13.95 /opt/weight_dataset_v1/2024_07_18_17_37_23_113...
4 dark 13.95 /opt/weight_dataset_v1/2024_07_18_17_37_25_262...
[5 rows x 35 columns]
Missing values per column:
ID 0
Global Key 0
Row Data 0
Dataset ID 0
Dataset Name 0
Created At 0
Updated At 0
Created By 0
Height 0
Width 0
Asset Type 0
MIME Type 0
EXIF Rotation 0
Experiment ID 0
Experiment Name 0
Run Name 0
Run Data Row ID 0
Split 0
Label Kind 0
Version 0
Label ID 0
Feature ID 0
Feature Schema ID 0
Name 0
Value 0
Annotation Kind 0
Bounding Box Top 0
Bounding Box Left 0
Bounding Box Height 0
Bounding Box Width 0
species 0
gender 552
color 0
weight 0
img_path 0
dtype: int64
Statistical Summary:
Height Width EXIF Rotation Bounding Box Top Bounding Box Left \
count 675.0 675.0 675.0 675.000000 675.000000
mean 720.0 1280.0 1.0 52.625185 488.053333
std 0.0 0.0 0.0 60.888040 30.595816
min 720.0 1280.0 1.0 0.000000 375.000000
25% 720.0 1280.0 1.0 0.000000 474.000000
50% 720.0 1280.0 1.0 19.000000 491.000000
75% 720.0 1280.0 1.0 111.000000 508.000000
max 720.0 1280.0 1.0 189.000000 541.000000
Bounding Box Height Bounding Box Width weight
count 675.000000 675.000000 675.000000
mean 640.204444 178.694815 5.213407
std 77.320269 30.051073 2.623659
min 440.000000 119.000000 1.800000
25% 569.000000 155.000000 3.250000
50% 667.000000 179.000000 4.750000
75% 706.000000 199.000000 6.350000
max 720.000000 269.000000 14.500000
Unique values per categorical column:
ID: 675 unique values
Global Key: 675 unique values
Row Data: 675 unique values
Dataset ID: 1 unique values
Dataset Name: 1 unique values
Created At: 9 unique values
Updated At: 657 unique values
Created By: 1 unique values
Asset Type: 1 unique values
MIME Type: 1 unique values
Experiment ID: 1 unique values
Experiment Name: 1 unique values
Run Name: 1 unique values
Run Data Row ID: 675 unique values
Split: 3 unique values
Label Kind: 1 unique values
Version: 1 unique values
Label ID: 675 unique values
Feature ID: 675 unique values
Feature Schema ID: 1 unique values
Name: 1 unique values
Value: 1 unique values
Annotation Kind: 1 unique values
species: 2 unique values
gender: 2 unique values
color: 3 unique values
img_path: 675 unique values
Rows with extremely high weight (>99th percentile 14.50): Empty DataFrame Columns: [ID, Global Key, Row Data, Dataset ID, Dataset Name, Created At, Updated At, Created By, Height, Width, Asset Type, MIME Type, EXIF Rotation, Experiment ID, Experiment Name, Run Name, Run Data Row ID, Split, Label Kind, Version, Label ID, Feature ID, Feature Schema ID, Name, Value, Annotation Kind, Bounding Box Top, Bounding Box Left, Bounding Box Height, Bounding Box Width, species, gender, color, weight, img_path] Index: [] [0 rows x 35 columns] Duplicate Rows: 0
📊 Exploratory Data Analysis (EDA) Report¶
🔍 Dataset Overview¶
- Total Records: 675
- Total Columns: 35
- Missing Values:
gender: 552 missing values (major issue)
- Duplicate Rows: 0
📈 Missing Values Analysis¶
A heatmap was generated to visualize missing data. The gender column has a
significant number of missing values.
📊 Statistical Summary¶
Key Numerical Features¶
| Feature | Mean | Std Dev | Min | 25% | 50% | 75% | Max |
|---|---|---|---|---|---|---|---|
| Bounding Box Height | 640.2 | 77.3 | 440 | 569 | 667 | 706 | 720 |
| Bounding Box Width | 178.7 | 30.1 | 119 | 155 | 179 | 199 | 269 |
| Weight | 5.21 | 2.62 | 1.8 | 3.25 | 4.75 | 6.35 | 14.5 |
🚀 Unique Values in Categorical Columns¶
- species: 2 unique values
- gender: 2 unique values (but heavily missing)
- color: 3 unique values
🚨 Outlier Detection¶
- Weight Distribution: No extreme outliers were found above the 99th percentile (14.5).
- Boxplot generated to visualize weight distribution.
📊 Data Distribution¶
- Weight Histogram: A KDE histogram was plotted to observe distribution.
- Categorical Columns: Counts of unique values were recorded.
📌 Conclusions¶
- Missing Data Concern:
gendercolumn has a high number of missing values (552 out of 675). - No Extreme Outliers: No weight values exceeded the 99th percentile threshold.
- Data Consistency: No duplicate rows found.
- Weight Distribution: Appears normal, mostly between 3.25 and 6.35.
- Categorical Insights: Limited unique values in key categorical variables.
🛠 Suggested Next Steps¶
- Consider imputing or removing the
gendercolumn due to excessive missing data. - Further analysis on
speciesandcolorfor classification. - More in-depth correlation analysis between weight and bounding box dimensions.
import seaborn as sns
import matplotlib.pyplot as plt
# Examining the relationship between species and gender
sns.countplot(x="species", hue="gender", data=df)
plt.title("Species and Gender Distribution")
plt.show()
# Examining the relationship between color and gender
sns.countplot(x="color", hue="gender", data=df)
plt.title("Color and Gender Distribution")
plt.show()
# Species and gender distribution table
species_gender_counts = df.groupby(["species", "gender"]).size().unstack()
print("### Species vs Gender Distribution ###")
print(species_gender_counts)
# Color and gender distribution table
color_gender_counts = df.groupby(["color", "gender"]).size().unstack()
print("\n### Color vs Gender Distribution ###")
print(color_gender_counts)
### Species vs Gender Distribution ### gender female male species chum 99 24 ### Color vs Gender Distribution ### gender female male color bright 7.0 NaN dark 63.0 24.0 semi_bright 29.0 NaN
📊 Exploratory Data Analysis (EDA) Results¶
1️⃣ Species vs. Gender Distribution¶
Raw Data:¶
| Species | Female | Male |
|---|---|---|
| Chum | 99 | 24 |
Analysis & Interpretation:¶
- Chum species has a strong gender imbalance, with significantly more females (99) than males (24).
- If gender is missing for a Chum individual, it is highly likely to be Female.
- This imbalance suggests that species can be used as a predictive feature for missing gender values.
--
2️⃣ Color vs. Gender Distribution¶
Raw Data:¶
| Color | Female | Male |
|---|---|---|
| Bright | 7 | 0 (NaN) |
| Dark | 63 | 24 |
| Semi-Bright | 29 | 0 (NaN) |
Analysis & Interpretation:¶
- Bright and Semi-Bright colors only have female individuals, meaning if an individual has these colors, it is highly likely to be Female.
- Dark color has both male (24) and female (63) individuals, but females dominate.
- Color can be a useful predictor for gender, especially for Bright and Semi-Bright individuals.
# Calculate the average weight by gender
gender_weight_mean = df.groupby("gender")["weight"].mean()
print("### Gender vs Weight ###")
print(gender_weight_mean)
### Gender vs Weight ### gender female 7.048485 male 14.225000 Name: weight, dtype: float64
3️⃣ Weight vs. Gender Distribution¶
Raw Data:¶
| Gender | Average Weight (kg) |
|---|---|
| Female | 7.05 kg |
| Male | 14.23 kg |
Analysis & Interpretation:¶
- Males are almost twice as heavy as females (14.23 kg vs. 7.05 kg).
- This strong difference means weight can be used to predict missing gender
values:
- If weight > 10 kg → Highly likely to be Male
- If weight ≤ 10 kg → Highly likely to be Female
import seaborn as sns
# Compute the correlation matrix
correlation_matrix = df[['weight', 'Bounding Box Height', 'Bounding Box Width']].corr()
# Plot the correlation heatmap
plt.figure(figsize=(8,6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Between Weight and Bounding Box Dimensions")
plt.show()
# Print correlation values
print("### Correlation Matrix ###")
print(correlation_matrix)
### Correlation Matrix ###
weight Bounding Box Height Bounding Box Width
weight 1.000000 0.756049 0.783899
Bounding Box Height 0.756049 1.000000 0.717996
Bounding Box Width 0.783899 0.717996 1.000000
4️⃣ Correlation Analysis (Weight & Bounding Box Dimensions)¶
Raw Data (Correlation Matrix):¶
| Feature | Weight | Bounding Box Height | Bounding Box Width |
|---|---|---|---|
| Weight | 1.00 | 0.76 | 0.78 |
| Bounding Box Height | 0.76 | 1.00 | 0.72 |
| Bounding Box Width | 0.78 | 0.72 | 1.00 |
Analysis & Interpretation:¶
- Weight has a strong correlation with both Bounding Box Height (0.76) and Bounding Box Width (0.78).
- This means Bounding Box dimensions can be used to estimate missing weight values.
- Bounding Box Width and Height are also correlated (0.72), meaning they are somewhat redundant.
- If an individual’s weight is missing, it can be estimated using its bounding box measurements.
📌 General Conclusions¶
✅ Species (Chum) is strongly biased towards females, meaning missing gender values
for this species are most likely Female.
✅ Bright and Semi-Bright colors are exclusively Female, making color a strong
predictor for gender.
✅ Weight is a highly effective predictor of gender, as males are significantly
heavier than females.
✅ Bounding Box Height & Width are strongly correlated with weight and can be
used for missing weight estimation.
✅ For missing gender values, a rule-based or machine learning model can be built
using species, color, and weight.
🛠 Recommended Next Steps¶
1️⃣ Predict missing gender values using rules:
- If color is Bright or Semi-Bright → Assign "Female".
- If species is Chum → Assign "Female".
2️⃣ Use a regression model to estimate missing weight based on bounding box
dimensions.
3️⃣ Train a machine learning model (Random Forest or Logistic Regression) to predict missing gender values.
2. Data Preprocessing¶
📌 Updated Strategy for Filling Missing Gender Values¶
Instead of using species, we will rely only on weight and
color, as they have a more direct relationship with gender.
📊 Rule-Based Gender Imputation¶
1️⃣ If color is "bright" or "semi_bright" → Assign "Female"
- (Since these colors had only female individuals)
2️⃣ If weight > 10 kg → Assign "Male"
- (Since males are significantly heavier on average)
3️⃣ Otherwise → Assign "Female"
- (Because females dominate in the dataset)
import pandas as pd
# Count missing values before update
missing_before = df["gender"].isnull().sum()
# Function to predict missing gender
def predict_gender(row):
if pd.isnull(row["gender"]): # Only apply if gender is missing
# Rule 1: If color is Bright or Semi-Bright, it's definitely Female
if row["color"] in ["bright", "semi_bright"]:
return "female"
# Rule 2: If weight is greater than 10 kg, it's most likely Male
elif row["weight"] > 10:
return "male"
# Default Rule: Assign Female (since females are the majority)
else:
return "female"
else:
return row["gender"]
# Apply the function to fill missing gender values
df["gender"] = df.apply(predict_gender, axis=1)
# Display how many missing values remain
print("Missing values after filling:", df["gender"].isnull().sum())
# Count missing values after update
missing_after = df["gender"].isnull().sum()
# Calculate how many missing values were filled
updated_count = missing_before - missing_after
df_original = df
# Print the summary
print(f"🚀 Gender Update Summary 🚀")
print(f"🔹 Missing values before update: {missing_before}")
print(f"🔹 Missing values after update: {missing_after}")
print(f"✅ Total updated gender values: {updated_count}")
Missing values after filling: 0 🚀 Gender Update Summary 🚀 🔹 Missing values before update: 552 🔹 Missing values after update: 0 ✅ Total updated gender values: 552
🚀 Gender Data Update Report¶
🔍 Summary of Changes¶
| Metric | Value |
|---|---|
| Missing gender values before update | 552 |
| Missing gender values after update | 0 |
| Total updated gender values | 552 |
✅ All 552 missing gender values were successfully filled.
✅ The dataset now has no missing gender values, ensuring consistency for further
analysis.
📌 Methodology Used for Gender Imputation¶
To ensure accuracy and maintain data integrity, a rule-based approach was applied:
1️⃣ If color was "bright" or "semi_bright" → Assigned "Female"
- These colors had only female individuals in the dataset, making this rule 100% accurate.
2️⃣ If weight > 10 kg → Assigned "Male"
- Male individuals were significantly heavier than females, making weight a strong predictor.
3️⃣ Otherwise → Assigned "Female"
- Females were the dominant gender in the dataset.
⚠️ Species data was intentionally excluded to avoid introducing bias, as gender data was heavily missing in some species categories.
📊 Impact on Data Quality¶
- The dataset is now fully structured, allowing for more reliable insights and predictions.
- Data consistency and completeness have significantly improved.
- Potential risk: Some borderline cases (e.g., individuals with weight around 10 kg) may require manual verification.
📌 Recommendation: A random sample verification step can further confirm accuracy.
📌 Next Steps & Recommendations¶
🔹 Perform a quality check on a small subset of the updated values to
confirm accuracy.
🔹 Validate gender distribution trends to ensure logical consistency.
🔹 If required, develop a machine learning model to refine gender
predictions for future data.
✅ Final Verdict¶
✔ The dataset is now complete, with no missing gender values.
✔ The applied method follows logical rules based on observed trends.
✔ Further validation is recommended for ensuring long-term data quality.
📌 If further refinements or additional validation steps are needed, we are happy to assist! 🚀
3. Correlation Matrix¶
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
print("Dataset Overview:")
print(df.info())
print("\nFirst 5 Rows:")
print(df.head())
# Print original column names
print("Original Columns:", df.columns.tolist())
# Clean column names (remove spaces and convert to lowercase)
df.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True)
print('height -->')
print(df.dtypes)
# Print updated column names
print("Updated Columns:", df.columns.tolist())
# Identify constant columns (having only one unique value) and drop them
constant_cols = [col for col in df.columns if df[col].nunique() == 1]
print("Constant Columns:", constant_cols)
df = df.drop(columns=constant_cols)
# Convert 'weight' column to float if necessary
df['weight'] = df['weight'].astype(float)
# Fill NaN values in 'weight' column with the median
df['weight'] = df['weight'].fillna(df['weight'].median())
# **DETECT AND ENCODE CATEGORICAL COLUMNS**
category_col = next((col for col in df.columns if 'category' in col.lower()), None)
if category_col and df[category_col].nunique() > 1:
df['category_encoded'] = LabelEncoder().fit_transform(df[category_col])
print(f"'{category_col}' column found and encoded.")
else:
print("Warning: No suitable 'category' column found or column has only one unique value.")
# Identify and encode general categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
if categorical_cols:
print("Categorical Columns:", categorical_cols)
for col in categorical_cols:
df[col] = LabelEncoder().fit_transform(df[col].astype(str))
else:
print("No categorical columns found in the dataset.")
# **IDENTIFY NUMERIC COLUMNS AND FILL MISSING VALUES**
numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
print("Numeric Columns:", numeric_cols)
if numeric_cols:
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
else:
print("Warning: No numeric columns found!")
Dataset Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 675 entries, 0 to 674
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 675 non-null object
1 Global Key 675 non-null object
2 Row Data 675 non-null object
3 Dataset ID 675 non-null object
4 Dataset Name 675 non-null object
5 Created At 675 non-null object
6 Updated At 675 non-null object
7 Created By 675 non-null object
8 Height 675 non-null int64
9 Width 675 non-null int64
10 Asset Type 675 non-null object
11 MIME Type 675 non-null object
12 EXIF Rotation 675 non-null int64
13 Experiment ID 675 non-null object
14 Experiment Name 675 non-null object
15 Run Name 675 non-null object
16 Run Data Row ID 675 non-null object
17 Split 675 non-null object
18 Label Kind 675 non-null object
19 Version 675 non-null object
20 Label ID 675 non-null object
21 Feature ID 675 non-null object
22 Feature Schema ID 675 non-null object
23 Name 675 non-null object
24 Value 675 non-null object
25 Annotation Kind 675 non-null object
26 Bounding Box Top 675 non-null int64
27 Bounding Box Left 675 non-null int64
28 Bounding Box Height 675 non-null int64
29 Bounding Box Width 675 non-null int64
30 species 675 non-null object
31 gender 675 non-null object
32 color 675 non-null object
33 weight 675 non-null float64
34 img_path 675 non-null object
dtypes: float64(1), int64(7), object(27)
memory usage: 184.7+ KB
None
First 5 Rows:
ID \
0 clyxetrm60mcb0796rsdu4ob9
1 clyxetrm60mcc0796uilaudlq
2 clyxetrm60mcd0796albl43as
3 clyxetrm60mce0796r9gf6geg
4 clyxetrm60mcf0796zbzj1nb6
Global Key \
0 upload-raw-images/circleseafoods-camera-03/202...
1 upload-raw-images/circleseafoods-camera-03/202...
2 upload-raw-images/circleseafoods-camera-03/202...
3 upload-raw-images/circleseafoods-camera-03/202...
4 upload-raw-images/circleseafoods-camera-03/202...
Row Data \
0 gs://upload-raw-images/circleseafoods-camera-0...
1 gs://upload-raw-images/circleseafoods-camera-0...
2 gs://upload-raw-images/circleseafoods-camera-0...
3 gs://upload-raw-images/circleseafoods-camera-0...
4 gs://upload-raw-images/circleseafoods-camera-0...
Dataset ID Dataset Name \
0 clyxesxqf00se0776dq71ylh0 Circleseafoods-18-July
1 clyxesxqf00se0776dq71ylh0 Circleseafoods-18-July
2 clyxesxqf00se0776dq71ylh0 Circleseafoods-18-July
3 clyxesxqf00se0776dq71ylh0 Circleseafoods-18-July
4 clyxesxqf00se0776dq71ylh0 Circleseafoods-18-July
Created At Updated At \
0 2024-07-22T19:58:52.688+00:00 2024-07-22T19:58:59.539+00:00
1 2024-07-22T19:58:52.688+00:00 2024-07-22T19:59:07.704+00:00
2 2024-07-22T19:58:52.688+00:00 2024-07-22T19:59:08.212+00:00
3 2024-07-22T19:58:52.688+00:00 2024-07-22T19:59:07.983+00:00
4 2024-07-22T19:58:52.688+00:00 2024-07-22T19:59:07.228+00:00
Created By Height Width ... Annotation Kind Bounding Box Top \
0 deepak@this.fish 720 1280 ... ImageBoundingBox 0
1 deepak@this.fish 720 1280 ... ImageBoundingBox 0
2 deepak@this.fish 720 1280 ... ImageBoundingBox 0
3 deepak@this.fish 720 1280 ... ImageBoundingBox 0
4 deepak@this.fish 720 1280 ... ImageBoundingBox 0
Bounding Box Left Bounding Box Height Bounding Box Width species gender \
0 492 699 193 chum female
1 449 720 235 chum male
2 448 720 235 chum male
3 449 720 237 chum male
4 449 720 238 chum male
color weight img_path
0 bright 4.75 /opt/weight_dataset_v1/2024_07_18_17_36_30_792...
1 dark 13.95 /opt/weight_dataset_v1/2024_07_18_17_37_23_113...
2 dark 13.95 /opt/weight_dataset_v1/2024_07_18_17_37_23_113...
3 dark 13.95 /opt/weight_dataset_v1/2024_07_18_17_37_23_113...
4 dark 13.95 /opt/weight_dataset_v1/2024_07_18_17_37_25_262...
[5 rows x 35 columns]
Original Columns: ['ID', 'Global Key', 'Row Data', 'Dataset ID', 'Dataset Name', 'Created At', 'Updated At', 'Created By', 'Height', 'Width', 'Asset Type', 'MIME Type', 'EXIF Rotation', 'Experiment ID', 'Experiment Name', 'Run Name', 'Run Data Row ID', 'Split', 'Label Kind', 'Version', 'Label ID', 'Feature ID', 'Feature Schema ID', 'Name', 'Value', 'Annotation Kind', 'Bounding Box Top', 'Bounding Box Left', 'Bounding Box Height', 'Bounding Box Width', 'species', 'gender', 'color', 'weight', 'img_path']
height -->
id object
global_key object
row_data object
dataset_id object
dataset_name object
created_at object
updated_at object
created_by object
height int64
width int64
asset_type object
mime_type object
exif_rotation int64
experiment_id object
experiment_name object
run_name object
run_data_row_id object
split object
label_kind object
version object
label_id object
feature_id object
feature_schema_id object
name object
value object
annotation_kind object
bounding_box_top int64
bounding_box_left int64
bounding_box_height int64
bounding_box_width int64
species object
gender object
color object
weight float64
img_path object
dtype: object
Updated Columns: ['id', 'global_key', 'row_data', 'dataset_id', 'dataset_name', 'created_at', 'updated_at', 'created_by', 'height', 'width', 'asset_type', 'mime_type', 'exif_rotation', 'experiment_id', 'experiment_name', 'run_name', 'run_data_row_id', 'split', 'label_kind', 'version', 'label_id', 'feature_id', 'feature_schema_id', 'name', 'value', 'annotation_kind', 'bounding_box_top', 'bounding_box_left', 'bounding_box_height', 'bounding_box_width', 'species', 'gender', 'color', 'weight', 'img_path']
Constant Columns: ['dataset_id', 'dataset_name', 'created_by', 'height', 'width', 'asset_type', 'mime_type', 'exif_rotation', 'experiment_id', 'experiment_name', 'run_name', 'label_kind', 'version', 'feature_schema_id', 'name', 'value', 'annotation_kind']
Warning: No suitable 'category' column found or column has only one unique value.
Categorical Columns: ['id', 'global_key', 'row_data', 'created_at', 'updated_at', 'run_data_row_id', 'split', 'label_id', 'feature_id', 'species', 'gender', 'color', 'img_path']
Numeric Columns: ['id', 'global_key', 'row_data', 'created_at', 'updated_at', 'run_data_row_id', 'split', 'label_id', 'feature_id', 'bounding_box_top', 'bounding_box_left', 'bounding_box_height', 'bounding_box_width', 'species', 'gender', 'color', 'weight', 'img_path']
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
# Load dataset
print("Key Features Preview:\n", df[['species', 'gender', 'color', 'weight',
'bounding_box_height', 'bounding_box_width']].head())
# 1. FEATURE SELECTION & CLEANING
# Keep only biological/visual features
relevant_features = [
'species', # Fish species
'gender', # Biological gender
'color', # Color pattern
'bounding_box_height', # Pixel height from image analysis
'bounding_box_width', # Pixel width from image analysis
'weight' # Target variable
]
df = df[relevant_features].copy()
# 2. CATEGORICAL FEATURE PROCESSING
# One-Hot Encoding for categorical variables
categorical_cols = ['species', 'gender', 'color']
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
# 3. FEATURE ENGINEERING
# Create meaningful derived features
df['fish_area'] = df['bounding_box_height'] * df['bounding_box_width'] # Area approximation
df['aspect_ratio'] = df['bounding_box_width'] / df['bounding_box_height'] # Shape characteristic
# 4. CORRELATION ANALYSIS (Enhanced)
plt.figure(figsize=(12,8))
corr_matrix = df.corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool)) # Upper triangle mask
sns.heatmap(corr_matrix.where(mask),
annot=True,
cmap="coolwarm",
fmt=".2f",
linewidths=0.5,
vmin=-1,
vmax=1,
cbar_kws={"shrink": 0.8})
plt.title("Feature Correlation Matrix", fontsize=14)
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
# 5. DATA PREPARATION
# Handle missing values
print("\nMissing Values Check:")
print(df.isnull().sum())
# Final cleaning
df = df.dropna().reset_index(drop=True)
# Split features and target
X = df.drop('weight', axis=1)
y = df['weight']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42
)
print("\nFinal Dataset Shapes:")
print(f"Training set: {X_train.shape}, Test set: {X_test.shape}")
# 6. VISUAL ANALYSIS
plt.figure(figsize=(10,6))
sns.scatterplot(
x='fish_area',
y='weight',
hue='species_chum' if 'species_chum' in df.columns else None,
data=df,
palette='viridis',
alpha=0.7
)
plt.title("Fish Area vs Weight Relationship", fontsize=14)
plt.xlabel("Fish Area (pixels²)")
plt.ylabel("Weight (kg)")
plt.grid(True, alpha=0.3)
plt.show()
# Print first 10 data points used in the scatter plot
print("\nFirst 10 data points for Fish Area vs Weight:")
print(df[['fish_area', 'weight']].head(10))
Key Features Preview:
species gender color weight bounding_box_height bounding_box_width
0 0 0 0 4.75 699 193
1 0 1 1 13.95 720 235
2 0 1 1 13.95 720 235
3 0 1 1 13.95 720 237
4 0 1 1 13.95 720 238
Missing Values Check: bounding_box_height 0 bounding_box_width 0 weight 0 species_1 0 gender_1 0 color_1 0 color_2 0 fish_area 0 aspect_ratio 0 dtype: int64 Final Dataset Shapes: Training set: (540, 8), Test set: (135, 8)
/tmp/ipykernel_155549/425341773.py:78: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.scatterplot(
First 10 data points for Fish Area vs Weight: fish_area weight 0 134907 4.75 1 169200 13.95 2 169200 13.95 3 170640 13.95 4 171360 13.95 5 170640 13.95 6 171360 13.95 7 149175 13.95 8 143808 13.95 9 147150 13.95
📊 Insights & Evaluation¶
🔹 Feature Selection & Cleaning¶
- The dataset retains key biological and visual features relevant to fish weight prediction.
- One-hot encoding was applied to categorical variables (
species,gender,color), allowing them to be used in machine learning models.
🔹 Correlation Analysis¶
- Fish Area (bounding_box_height × bounding_box_width) shows a strong positive correlation with weight, suggesting it is a key predictor.
- Aspect Ratio (bounding_box_width / bounding_box_height) has a weaker correlation, indicating its limited direct impact on weight prediction.
🔹 Data Preparation¶
- Missing values were successfully handled, reducing data inconsistencies.
- The dataset was split into 540 training samples and 135 test samples, ensuring a reasonable train-test ratio (80-20 split).
🔹 Visual Analysis: Fish Area vs. Weight¶
- A positive relationship is observed between fish area and weight, confirming that larger fish generally weigh more.
- The data shows some outliers, which might need further investigation.
- The scatter plot could benefit from color differentiation by species to analyze species-specific weight variations.
🚀 Next Steps¶
- Feature Scaling: Standardize numerical features to improve model performance.
- Outlier Detection: Examine extreme values in the dataset for potential errors or irregularities.
- Model Selection: Compare regression models (Linear Regression, Decision Trees, Neural Networks) to identify the best predictive approach.
📈 Key Findings¶
-
Bounding Box Dimensions:
bounding_box_heightandbounding_box_widthare highly correlated (0.72), as expected.- Both features strongly correlate with
fish_area(0.88 and 0.96), confirming that area is directly dependent on these values.
-
Weight Relationships:
weighthas a strong correlation withfish_area(0.85), suggesting that larger fish tend to be heavier.- However,
weighthas a negative correlation withspecies_1(-0.63), which implies that different species may have significantly different weight distributions.
-
Species & Color Influence:
species_1has a weak negative correlation withcolor_1(-0.36) andcolor_2(-0.46), meaning species classification may not be strongly influenced by color.gender_1has a mild correlation withcolor_1(0.36), indicating possible gender-based color differences.
📌 Conclusion¶
- The high correlation between
bounding_box_width,bounding_box_height, andfish_areasuggests redundancy; one of these features may be removed or transformed. species_1andweightshow a meaningful inverse relationship, which can be explored further for classification tasks.- The weak correlations between species and colors suggest that color alone is not a strong distinguishing factor between species.
4. Feature Selection¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
# Drop missing values
df.dropna(inplace=True)
# Encode categorical variables
label_encoders = {}
for col in df.select_dtypes(include=["object"]).columns:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le
# Define features and target variable
X = df.drop(columns=["weight"]) # Specify your target column
y = df["weight"]
# Scale features
scaler = StandardScaler() #-- Tabular Data Normalization
X_scaled = scaler.fit_transform(X)
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Train Random Forest for feature importance
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Compute feature importance
feature_importances = pd.DataFrame({
"Feature": X.columns,
"Importance": model.feature_importances_
})
# Include `fish_area` and `aspect_ratio` in ranking
feature_importances = feature_importances[feature_importances["Feature"].isin(["fish_area", "aspect_ratio"]) | True]
feature_importances = feature_importances.sort_values(by="Importance", ascending=False)
#Print
print("Feature Importances:\n", feature_importances)
# Plot feature importance
plt.figure(figsize=(12, 6))
sns.barplot(x="Importance", y="Feature", data=feature_importances, palette="coolwarm")
plt.title("Feature Importance (Random Forest)")
plt.show()
Feature Importances:
Feature Importance
6 fish_area 0.506355
0 bounding_box_height 0.311614
3 gender_1 0.157656
7 aspect_ratio 0.013495
1 bounding_box_width 0.006237
5 color_2 0.002517
4 color_1 0.002121
2 species_1 0.000005
/tmp/ipykernel_155549/3516137029.py:48: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x="Importance", y="Feature", data=feature_importances, palette="coolwarm")
📊 Feature Importance Analysis (Random Forest)¶
🔹 Feature Ranking & Interpretation¶
| Rank | Feature | Importance | Interpretation |
|---|---|---|---|
| 1️⃣ | fish_area |
0.5064 | The most important feature (~50.6%). Indicates that the total area occupied by the fish is highly correlated with weight. |
| 2️⃣ | bounding_box_height |
0.3116 | Highly important (~31.2%). Suggests that fish height is a strong predictor of weight. |
| 3️⃣ | gender_1 |
0.1577 | Moderately important (~15.8%). Gender differences may impact weight distribution. |
| 4️⃣ | aspect_ratio |
0.0135 | Low importance (~1.3%). Shape alone is not a strong determinant of weight. |
| 5️⃣ | bounding_box_width |
0.0062 | Very low importance (~0.6%). Width is much less relevant than height. |
| 6️⃣ | color_2 |
0.0025 | Negligible importance (~0.25%). Fish color does not significantly impact weight prediction. |
| 7️⃣ | color_1 |
0.0021 | Almost irrelevant (~0.21%). Similar to color_2, color is not a key
weight determinant. |
| 8️⃣ | species_1 |
0.000005 | Insignificant. Species type does not provide useful information for weight prediction. |
Exploratory Data Analysis and Image Preprocessing¶
# import cv2
# import numpy as np
# import seaborn as sns
# import matplotlib.pyplot as plt
# from sklearn.model_selection import train_test_split
# # Data Visualization
# sns.histplot(df['weight'], bins=30, kde=True)
# plt.title("Weight Distribution")
# plt.show()
# print(df.columns)
# # Image Preprocessing Function
# def load_and_preprocess_image(img_path, img_size=(224, 224)):
# img = cv2.imread(img_path)
# if img is None:
# print(f"Warning: Image at {img_path} not found.")
# return np.zeros((*img_size, 3)) # Return a blank image if missing
# img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# img = cv2.resize(img, img_size) / 255.0
# return img
# # Restore 'img_path'
# df['img_path'] = df_original['img_path']
# # Process images efficiently
# image_paths = df['img_path'].values
# images = np.array([load_and_preprocess_image(img) for img in image_paths])
# # Splitting Data
# X = images # Already in NumPy array format
# y = df['weight'].values
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Comparing XGBoost, Random Forest, and DenseNet for Regression¶
# import numpy as np
# import tensorflow as tf
# import matplotlib.pyplot as plt
# import seaborn as sns
# from tensorflow.keras.applications import DenseNet121
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Dense, Dropout, GlobalAveragePooling2D
# from tensorflow.keras.preprocessing.image import ImageDataGenerator
# from sklearn.ensemble import RandomForestRegressor
# from xgboost import XGBRegressor
# from sklearn.model_selection import train_test_split
# # Reshape Data for XGBoost & Random Forest
# X_train_flat = X_train.reshape(X_train.shape[0], -1) # Flatten images for XGBoost & RF
# X_test_flat = X_test.reshape(X_test.shape[0], -1)
# # ---------------- XGBoost Model ----------------
# xgb_model = XGBRegressor()
# xgb_model.fit(X_train_flat, y_train)
# # ---------------- Random Forest Model ----------------
# rf_model = RandomForestRegressor()
# rf_model.fit(X_train_flat, y_train)
# # ---------------- Deep Learning Model ----------------
# def create_dense_net():
# base_model = DenseNet121(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
# base_model.trainable = False # Freeze pre-trained weights
# model = Sequential([
# base_model,
# GlobalAveragePooling2D(),
# Dense(128, activation='relu'),
# Dropout(0.3),
# Dense(1, activation='linear')
# ])
# model.compile(optimizer='adam', loss='mse', metrics=['mae'])
# return model
# # Data Augmentation (Only for CNN)
# datagen = ImageDataGenerator(
# rotation_range=20,
# width_shift_range=0.2,
# height_shift_range=0.2,
# horizontal_flip=True
# )
# # Train CNN Model
# nn_model = create_dense_net()
# nn_model.fit(datagen.flow(X_train, y_train, batch_size=32),
# epochs=10,
# validation_data=(X_test, y_test))
# # ---------------- Model Evaluation ----------------
# xgb_pred = xgb_model.predict(X_test_flat)
# rf_pred = rf_model.predict(X_test_flat)
# nn_pred = nn_model.predict(X_test)
# # ---------------- Plot Predictions ----------------
# plt.figure(figsize=(10, 5))
# sns.scatterplot(x=y_test, y=xgb_pred, label='XGBoost', alpha=0.6)
# sns.scatterplot(x=y_test, y=rf_pred, label='Random Forest', alpha=0.6)
# sns.scatterplot(x=y_test, y=nn_pred[:, 0], label='DenseNet', alpha=0.6)
# plt.xlabel("Actual Weight")
# plt.ylabel("Predicted Weight")
# plt.title("Model Predictions vs. Actual Weight")
# plt.legend()
# plt.show()
# from sklearn.metrics import mean_squared_error, mean_absolute_error
# # Compute error metrics
# xgb_mse = mean_squared_error(y_test, xgb_pred)
# xgb_rmse = np.sqrt(xgb_mse)
# xgb_mae = mean_absolute_error(y_test, xgb_pred)
# rf_mse = mean_squared_error(y_test, rf_pred)
# rf_rmse = np.sqrt(rf_mse)
# rf_mae = mean_absolute_error(y_test, rf_pred)
# nn_mse = mean_squared_error(y_test, nn_pred)
# nn_rmse = np.sqrt(nn_mse)
# nn_mae = mean_absolute_error(y_test, nn_pred)
# # Print results
# print(f"XGBoost - MSE: {xgb_mse:.4f}, RMSE: {xgb_rmse:.4f}, MAE: {xgb_mae:.4f}")
# print(f"Random Forest - MSE: {rf_mse:.4f}, RMSE: {rf_rmse:.4f}, MAE: {rf_mae:.4f}")
# print(f"DenseNet - MSE: {nn_mse:.4f}, RMSE: {nn_rmse:.4f}, MAE: {nn_mae:.4f}")
# # Determine the best model based on RMSE
# models = {"XGBoost": xgb_rmse, "Random Forest": rf_rmse, "DenseNet": nn_rmse}
# best_model = min(models, key=models.get)
# print(f"\nThe best performing model is: {best_model}")
Exploratory Data Analysis and Image Preprocessing¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.applications import ResNet50, EfficientNetB0, DenseNet121
from tensorflow.keras.layers import Dense, Dropout, Flatten, GlobalAveragePooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
# Data Visualization
sns.histplot(df['weight'], bins=30, kde=True)
plt.title("Weight Distribution")
plt.show()
print(df.columns)
# Function to Read and Preprocess Images --Image Data Normalization
def load_and_preprocess_image(img_path, img_size=(224, 224)):
img = cv2.imread(img_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img, img_size) / 255.0 # Normalize to [0,1]
return img
# Restore 'img_path'
df['img_path'] = df_original['img_path']
# Process images efficiently
image_paths = df['img_path'].values
images = np.array([load_and_preprocess_image(img) for img in image_paths])
# Splitting Data
X = images # Already in NumPy array format
y = df['weight'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Index(['bounding_box_height', 'bounding_box_width', 'weight', 'species_1',
'gender_1', 'color_1', 'color_2', 'fish_area', 'aspect_ratio'],
dtype='object')
Comparing XGBoost, Random Forest, and DenseNet for Regression¶
# Reshape for ML models (Flatten Image Features)
X_train_flat = X_train.reshape(X_train.shape[0], -1)
X_test_flat = X_test.reshape(X_test.shape[0], -1)
# XGBoost Model
xgb_model = XGBRegressor()
xgb_model.fit(X_train_flat, y_train)
# Random Forest Model
rf_model = RandomForestRegressor()
rf_model.fit(X_train_flat, y_train)
# SVR Model
svr_model = SVR()
svr_model.fit(X_train_flat, y_train)
SVR()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVR()
def create_cnn_model(base_model):
base_model.trainable = False
model = Sequential([
base_model,
GlobalAveragePooling2D(),
Dense(128, activation='relu'),
Dropout(0.3),
Dense(1, activation='linear')
])
model.compile(optimizer=Adam(learning_rate=0.001), loss='mse', metrics=['mae'])
return model
# DenseNet-Based Model
cnn_model = create_cnn_model(DenseNet121(weights='imagenet', include_top=False, input_shape=(224, 224, 3)))
# Data Augmentation
datagen = ImageDataGenerator(rotation_range=20, width_shift_range=0.2, height_shift_range=0.2, horizontal_flip=True)
# Training the Model
cnn_model.fit(datagen.flow(X_train, y_train, batch_size=32), epochs=10, validation_data=(X_test, y_test))
I0000 00:00:1741191843.510804 155549 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 499 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5 /home/tessaayv/.local/lib/python3.10/site-packages/keras/src/trainers/data_adapters/py_dataset_adapter.py:121: UserWarning: Your `PyDataset` class should call `super().__init__(**kwargs)` in its constructor. `**kwargs` can include `workers`, `use_multiprocessing`, `max_queue_size`. Do not pass these arguments to `fit()`, as they will be ignored. self._warn_if_super_not_called()
Epoch 1/10
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1741191860.590934 160213 service.cc:148] XLA service 0x7c64a8003240 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: I0000 00:00:1741191860.591934 160213 service.cc:156] StreamExecutor device (0): Tesla T4, Compute Capability 7.5 2025-03-05 16:24:21.080408: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable. I0000 00:00:1741191863.701297 160213 cuda_dnn.cc:529] Loaded cuDNN version 90300 2025-03-05 16:24:25.102520: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 241.09MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available. 2025-03-05 16:24:25.604711: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 280.14MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available. 2025-03-05 16:24:25.648408: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 457.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available. 2025-03-05 16:24:25.749729: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 511.05MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available. 2025-03-05 16:24:25.846455: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 676.06MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available. 2025-03-05 16:24:25.958033: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 841.08MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available. 2025-03-05 16:24:26.084448: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1006.09MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available. 2025-03-05 16:24:26.205593: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 183.34MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available. 2025-03-05 16:24:26.205649: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.14GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available. 2025-03-05 16:24:26.343300: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 198.75MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
1/17 ━━━━━━━━━━━━━━━━━━━━ 7:01 26s/step - loss: 17.0839 - mae: 3.3259
I0000 00:00:1741191875.227670 160213 device_compiler.h:188] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 1s/step - loss: 10.7192 - mae: 2.4925
2025-03-05 16:24:52.826707: W external/local_xla/xla/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 81285120 exceeds 10% of free system memory. 2025-03-05 16:24:52.926557: W external/local_xla/xla/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 81285120 exceeds 10% of free system memory.
17/17 ━━━━━━━━━━━━━━━━━━━━ 64s 2s/step - loss: 10.6107 - mae: 2.4767 - val_loss: 5.8343 - val_mae: 1.9866 Epoch 2/10 17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 334ms/step - loss: 5.7615 - mae: 1.7735 - val_loss: 3.4834 - val_mae: 1.3618 Epoch 3/10 17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 325ms/step - loss: 3.7223 - mae: 1.3728 - val_loss: 2.6208 - val_mae: 1.0280 Epoch 4/10 17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 328ms/step - loss: 3.2872 - mae: 1.2143 - val_loss: 2.3364 - val_mae: 0.8743 Epoch 5/10 17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 327ms/step - loss: 2.7348 - mae: 1.1827 - val_loss: 2.1418 - val_mae: 0.8723 Epoch 6/10 17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 325ms/step - loss: 2.3446 - mae: 1.0587 - val_loss: 1.8281 - val_mae: 0.7991 Epoch 7/10 17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 324ms/step - loss: 3.0406 - mae: 1.1755 - val_loss: 1.9083 - val_mae: 0.8441 Epoch 8/10 17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 348ms/step - loss: 2.5729 - mae: 1.1094 - val_loss: 1.6396 - val_mae: 0.7463 Epoch 9/10 17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 336ms/step - loss: 2.3070 - mae: 1.0819 - val_loss: 1.5650 - val_mae: 0.8390 Epoch 10/10 17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 327ms/step - loss: 1.9772 - mae: 0.9897 - val_loss: 1.3946 - val_mae: 0.7119
<keras.src.callbacks.history.History at 0x7c670a8ff880>
# import tensorflow as tf
# from xgboost import XGBRegressor
# # 1. DEFINE THE MODEL USING FUNCTIONAL API
# def create_cnn_feature_extractor():
# inputs = tf.keras.Input(shape=(224, 224, 3)) # Explicitly define input
# base_model = DenseNet121(weights='imagenet', include_top=False, input_tensor=inputs)
# x = GlobalAveragePooling2D()(base_model.output)
# x = Dense(64, activation='relu')(x)
# x = Dropout(0.2)(x)
# outputs = Dense(1, activation='linear')(x)
# model = tf.keras.Model(inputs=inputs, outputs=outputs)
# return model
# # 2. CREATE AND COMPILE THE MODEL
# cnn_model = create_cnn_feature_extractor()
# cnn_model.compile(optimizer='adam', loss='mse', metrics=['mae'])
# # 3. RUN THE MODEL ONCE TO DEFINE THE INPUT SHAPE
# dummy_input = np.random.rand(1, 224, 224, 3) # Rastgele bir input örneği
# _ = cnn_model.predict(dummy_input)
# # 4. CREATE FEATURE EXTRACTOR
# feature_extractor = tf.keras.Model(
# inputs=cnn_model.input,
# outputs=cnn_model.layers[-4].output # GlobalAveragePooling2D katmanı
# )ImageDataGenerator
# # 5. EXTRACT FEATURES
# X_train_features = feature_extractor.predict(X_train, batch_size=8)
# X_test_features = feature_extractor.predict(X_test, batch_size=8)
# # 6. TRAIN XGBOOST REGRESSOR
# xgb_hybrid = XGBRegressor(
# n_estimators=200,
# learning_rate=0.05,
# max_depth=5,
# subsample=0.8
# )
# xgb_hybrid.fit(X_train_features, y_train)
Evaluating Model Performance Using MSE, MAE, and R²¶
def evaluate_model(model, X_test, y_test, model_name):
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"{model_name} -> MSE: {mse:.4f}, MAE: {mae:.4f}, R²: {r2:.4f}")
return y_pred
# Evaluate Models
xgb_pred = evaluate_model(xgb_model, X_test_flat, y_test, "XGBoost")
rf_pred = evaluate_model(rf_model, X_test_flat, y_test, "Random Forest")
svr_pred = evaluate_model(svr_model, X_test_flat, y_test, "SVR")
cnn_pred = evaluate_model(cnn_model, X_test, y_test, "CNN (DenseNet)")
# hybrid_pred = evaluate_model(xgb_hybrid, X_test_features, y_test, "CNN + XGBoost")
XGBoost -> MSE: 0.0059, MAE: 0.0172, R²: 0.9990 Random Forest -> MSE: 0.0106, MAE: 0.0653, R²: 0.9982
Model Predictions vs. Ground Truth: Scatter Plot Analysis¶
plt.figure(figsize=(10,5))
sns.scatterplot(x=y_test, y=xgb_pred, label='XGBoost')
sns.scatterplot(x=y_test, y=rf_pred, label='Random Forest')
sns.scatterplot(x=y_test, y=svr_pred, label='SVR')
sns.scatterplot(x=y_test, y=cnn_pred[:,0], label='CNN (DenseNet)')
# sns.scatterplot(x=y_test, y=hybrid_pred, label='CNN + XGBoost')
plt.xlabel("Actual Weight")
plt.ylabel("Predicted Weight")
plt.legend()
plt.show()
Model Performance Comparison: Predicted vs Actual Weights¶
!pip install plotly
import numpy as np
import plotly.graph_objects as go
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Define the evaluation function
def evaluate_model(model, X_test, y_test, model_name):
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"{model_name} -> MSE: {mse:.4f}, MAE: {mae:.4f}, R²: {r2:.4f}")
return mse, mae, r2
# Evaluate Models and Collect Metrics
models = {
"XGBoost": (xgb_model, X_test_flat),
"Random Forest": (rf_model, X_test_flat),
"SVR": (svr_model, X_test_flat),
"CNN (DenseNet)": (cnn_model, X_test),
# "CNN + XGBoost": (xgb_hybrid, X_test_features),
}
metrics = {name: evaluate_model(model, X, y_test, name) for name, (model, X) in models.items()}
# Convert results to numpy arrays for plotting
mse_values = np.array([m[0] for m in metrics.values()])
mae_values = np.array([m[1] for m in metrics.values()])
r2_values = np.array([m[2] for m in metrics.values()])
model_names = list(metrics.keys())
# Create interactive bar charts with Plotly
fig = go.Figure()
# MSE Plot
fig.add_trace(go.Bar(
y=model_names,
x=mse_values,
name="MSE (Lower is Better)",
orientation='h',
marker=dict(color='skyblue')
))
# MAE Plot
fig.add_trace(go.Bar(
y=model_names,
x=mae_values,
name="MAE (Lower is Better)",
orientation='h',
marker=dict(color='salmon')
))
# R² Score Plot
fig.add_trace(go.Bar(
y=model_names,
x=r2_values,
name="R² Score (Higher is Better)",
orientation='h',
marker=dict(color='lightgreen')
))
# Update layout for better aesthetics and labeling
fig.update_layout(
title="Model Performance Comparison",
xaxis_title="Metric Value",
yaxis_title="Models",
barmode='group', # Grouped bars for side-by-side comparison
template="plotly_dark", # Dark theme
height=600
)
# Show the plot
fig.show()
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: plotly in ./.local/lib/python3.10/site-packages (6.0.0) Requirement already satisfied: narwhals>=1.15.1 in ./.local/lib/python3.10/site-packages (from plotly) (1.29.0) Requirement already satisfied: packaging in ./.local/lib/python3.10/site-packages (from plotly) (24.2) XGBoost -> MSE: 0.0059, MAE: 0.0172, R²: 0.9990 Random Forest -> MSE: 0.0173, MAE: 0.0751, R²: 0.9970 SVR -> MSE: 2.1956, MAE: 0.7386, R²: 0.6221 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 79ms/step CNN (DenseNet) -> MSE: 29.7716, MAE: 4.7549, R²: -4.1244 CNN + XGBoost -> MSE: 0.1081, MAE: 0.0658, R²: 0.9814
Model Performance Evaluation¶
Based on the following metrics: MSE (Mean Squared Error), MAE (Mean Absolute Error), and R² (Coefficient of Determination), we can evaluate the performance of each model.
Evaluation Metrics:¶
- MSE: Lower is better. Measures the average squared difference between predicted and actual values.
- MAE: Lower is better. Measures the average absolute difference between predicted and actual values.
- R²: Higher is better. Measures how well the predictions match the actual values (1 is perfect, 0 means no better than random guessing).
Model Comparison¶
| Model | MSE | MAE | R² |
|---|---|---|---|
| XGBoost | 0.0059 | 0.0172 | 0.9990 |
| Random Forest | 0.0173 | 0.0751 | 0.9970 |
| SVR | 2.1956 | 0.7386 | 0.6221 |
| CNN (DenseNet) | 29.7716 | 4.7549 | -4.1244 |
| CNN + XGBoost | 0.1081 | 0.0658 | 0.9814 |
Performance Analysis¶
-
Best Model: XGBoost
- MSE: 0.0059 (Lowest, ideal)
- MAE: 0.0172 (Lowest, ideal)
- R²: 0.9990 (Highest, ideal)
XGBoost performs the best with the lowest MSE, lowest MAE, and highest R² values, making it the top-performing model for this task.
-
Second Best Model: Random Forest
- MSE: 0.0173 (Higher than XGBoost, less ideal)
- MAE: 0.0751 (Higher than XGBoost, less ideal)
- R²: 0.9970 (Lower than XGBoost, less ideal)
Random Forest is a good model, but its performance lags behind XGBoost in terms of MSE, MAE, and R².
-
Worst Performing Models: SVR and CNN (DenseNet)
-
SVR:
- MSE: 2.1956 (Very high, poor performance)
- MAE: 0.7386 (Much higher than other models)
- R²: 0.6221 (Quite low)
SVR has a very high MSE and low R², indicating poor predictive performance.
-
CNN (DenseNet):
- MSE: 29.7716 (Extremely high, poor performance)
- MAE: 4.7549 (Very high, poor performance)
- R²: -4.1244 (Negative, poor model)
CNN (DenseNet) has a very high MSE, MAE, and a negative R², making it the least suitable model for this task.
-
-
Hybrid Model (CNN + XGBoost):
- MSE: 0.1081 (Higher than XGBoost, but better than Random Forest)
- MAE: 0.0658 (Better than Random Forest, worse than XGBoost)
- R²: 0.9814 (Good, but lower than XGBoost)
The Hybrid Model (CNN + XGBoost) performs well but does not outperform XGBoost as an individual model. The improvement from hybridization is minimal in this case.
Conclusion¶
-
Best Performing Model: XGBoost
- XGBoost demonstrates the best overall performance across all evaluation metrics. It has the lowest MSE, MAE, and the highest R², making it the most reliable model for this task.
-
Second Best: Random Forest
- Random Forest performs reasonably well but falls short of XGBoost in terms of performance metrics.
-
Worst Performing Models: SVR and CNN (DenseNet)
- Both SVR and CNN (DenseNet) show poor performance with very high MSE, high MAE, and low R² values, making them unsuitable for this particular problem.
Final Recommendation¶
- XGBoost should be used as the primary model for this task based on its superior performance.